Density estimation in R

نویسندگان

  • Henry Deng
  • Hadley Wickham
چکیده

Density estimation is an important statistical tool, and within R there are over 20 packages that implement it: so many that it is often difficult to know which to use. This paper presents a brief outline of the theory underlying each package, as well as an overview of the code and comparison of speed and accuracy. We focus on univariate methods, but include pointers to other more specialised packages. Overall, we found ASH and KernSmooth to be excellent: they are both fast, accurate, and well-maintained. 1 Motivation There are over 20 packages that perform density estimation in R, varying in both theoretical approach and computational performance. Users and developers who require density estimation tools have different needs, and some methods of density estimation may be more appropriate than others. This paper aims to summarise the existing approaches to make it easier to pick the right package for the job. We begin in Section 2 with a brief review of the underlying theory behind the main approaches for density estimation, providing links to the relevant literature. In Section 3, we describe the R packages that implement each approach, highlighting the basic code needed to run their density estimation function and listing differences in features (dimensionality, bounds, bandwidth selection, etc). Section 4 compares the performance of each package with calculation speed, looking at density estimation computations from 10 to 10 million observations. The accuracy of the density estimations generated is also important. Section 5 compares the accuracy of the density estimates using three distributions with varying degrees of challenge: the uniform, normal and claw distributions. Section 6 investigates the relationship between calculation time and accuracy, and we conclude in Section 7 with our findings and recommendations. 2 Theoretical approaches Density estimation builds an estimate of some underlying probability density function using an observed data sample. Density estimation can either be parametric, where the 1 data is from a known family, or nonparametric, which attempts to flexibly estimate an unknown distribution. We begin with a brief overview of the underlying theory, focusing on nonparametric methods because of their generality. Common methods include histograms, Section 2.1, kernel methods, Section 2.2, and penalized approaches, Section 2.3. We attempt to give the flavor of edge method, without going into too much detail. For a more in-depth treatment, we recommend Scott (1992a) and Silverman (1986). We will assume that we have n iid data points, X1, X2, ..., Xn, and we are interested in an estimate, f̂(x), of the true density, f(x), at new location x. 2.1 Histogram The histogram (Silverman, 1986) is the oldest (dating to the 1840’s (Friendly, 2005)) and least sophisticated method of density estimation. The main advantages are its extreme simplicity and speed of computation. A histogram is piecewise constant (hence not at all smooth) and can be extremely sensitive to the choice of bin origin. A simple enhancement to the histogram is the average shifted histogram (ASH): it is smoother than the histogram and avoids sensitivity to the choice of origin, but is still computationally efficient. The premise of this approach (Scott, 1992b) is to take m histograms, f̂1, f̂2, ..., f̂m, of bin width h with origins of to = 0, h m , 2h m , ..., (m−1)h m . As the name suggests, the näıve ASH is simply f̂ash(x) = 1 m m ∑ i=1 f̂i(x) There are k = 1 . . . m · n bins across all histograms, each spanning [k h m , (k + 1) h m ] with center (k+0.5) h m . The ASH can be made somewhat more general by using all bins to estimate the density at each point, weighting bins closer to the data more highly. The general form of the weighted ASH is: f̂ash(x) = 1 m m·n ∑ k=1 w(lk − x)ĉk(x) where w is a weighting function, lk is the center of bin k, and ĉk is the number of points in that bin. Because the univariate ASH is piecewise constant, it can be computed by taking a histogram with m · n bins and computing a rolling sum over m adjacent bins. This makes the ASH extremely fast to compute. 2.2 Kernel density estimation The kernel density estimation approach overcomes the discreteness of the histogram approaches by centering a smooth kernel function at each data point then summing to get a density estimate. The basic kernel estimator can be expressed as f̂kde(x) = 1 n n ∑ i=1 K ( x− xi h )

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Artificial Neural Network for Estimation of Density and Viscosities of Biodiesel–Diesel Blends

In recent years, biodiesel has been considered as a good alternative of diesel fuels. Density and viscosity are two important properties of these fuels. In this study, density and kinematic viscosity of biodiesel-diesel blends were estimated by using artificial neural network (ANN). A three-layer feed forward neural network with Levenberg-Marquard (LM) algorithm was used for learning empirical ...

متن کامل

Using Artificial Neural Network for Estimation of Density and Viscosities of Biodiesel–Diesel Blends

In recent years, biodiesel has been considered as a good alternative of diesel fuels. Density and viscosity are two important properties of these fuels. In this study, density and kinematic viscosity of biodiesel-diesel blends were estimated by using artificial neural network (ANN). A three-layer feed forward neural network with Levenberg-Marquard (LM) algorithm was used for learning empirical ...

متن کامل

تخمین احتمال بزرگی زمین‌لغزش‌های رخ‌داده در حوزه آبخیز پیوه‌ژن (استان خراسان رضوی)

Knowing the number, area, and frequency of landslides occurred in each area has a prominent role in the long-term evolution of area dominated by landslides and can be used for analyzing of susceptibility, hazard, and risk. In this regard, the current research is trying to consider identified landslides size probability in the Pivejan Watershed, Razavi Khorasan Province. In the first step, lands...

متن کامل

Estimation of Density using Plotless Density Estimator Criteria in Arasbaran Forest

    Sampling methods have a theoretical basis and should be operational in different forests; therefore selecting an appropriate sampling method is effective for accurate estimation of forest characteristics. The purpose of this study was to estimate the stand density (number per hectare) in Arasbaran forest using a variety of the plotless density estimators of the nearest neighbors sampling me...

متن کامل

Geostatistically estimation and mapping of forest stock in a natural unmanaged forest in the Caspian region of Iran

Estimation and mapping of forest resources are preconditions for management, planning and research. In this study, we applied kriging interpolation of geostatistics for estimation and mapping of forest stock at-tributes in a natural, uneven-aged, unmanaged forest in the Caspian region of northern Iran. The site of the study has an area of 516 ha and an elevation that ranges from 1100 to 1450 m ...

متن کامل

On Efficiency Criteria in Density Estimation

We discuss the classical efficiency criteria in density estimation and propose some variants. The context is a general density estimation scheme that contains the cases of i.i.d. or dependent random variables, in discrete or continuous time. Unbiased estimation, optimality and asymptotic optimality are considered. An example of a density estimator that satisfies some suggested criteria is given...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014